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Abstract 

This paper considers the problem of estimating the channel response (or Green's 
function) between multiple source-receiver pairs. Typically, the channel responses are 
estimated one-at-a-time: a single source sends out a known probe signal, the receiver 
measures the probe signal convolved with the channel response, and the responses are 
recovered using deconvolution. In this paper, we show that if the channel responses 
are sparse and the probe signals are random, then we can significantly reduce the 
total amount of time required to probe the channels by activating all of the sources 
simultaneously. With all sources activated simultaneously, the receiver measures a 
superposition of all the channel responses convolved with the respective probe signals. 
Separating this cumulative response into individual channel responses can be posed as 
a linear inverse problem. 

We show that channel response separation is possible (and stable) even when the 
probing signals are relatively short in spite of the corresponding linear system of equa- 
tions becoming severely underdetermined. We derive a theoretical lower bound on the 
length of the source signals that guarantees that this separation is possible with high 
probability. The bound is derived by putting the problem in the context of finding a 
sparse solution to an underdetermined system of equations, and then using mathemat- 
ical tools from the theory of compressive sensing. Finally, we discuss some practical 
applications of these results, which include forward modeling for seismic imaging, chan- 
nel equalization in multiple-input multiple-output communication, and increasing the 
field-of-view in an imaging system by using coded apertures. 



1 Introduction 



This paper gives a theoretical treatment to the problem of channel estimation in multiple- 
input multiple-output (MIMO) systems. The general scenario is illustrated in Figure (T) A 
set ofp sources emit different probe signals, which then travel through different channels and 
are observed by q receivers. We will assume that the channel between each source/receiver 
pair is linear and time-invariant; if source i sends probe signal then receiver j observes 
the convolution (pi -k hij of the probe signal with the corresponding channel response hij. 
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The goal is to estimate all of the channel responses, and to do so using the smallest total 
amount of probing time. 



We will focus on the discrete version of this problem. We assume that each channel response 
hij has length n. If a single source i emits a probe sequence 0j(2), . . . , (/)j(m)} of 

length m, receiver j observe^ the linear convolution 
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Recovering hi^j from is a classical deconvolution problem. The inverse problem can 
be made very well conditioned if is chosen carefully; if not, then the inversion can be 
regularized using some type of prior information about the channel. 

We will measure the cost of the channel estimation by the amount of time we spend probing 
the channel, which we can see is proportional to n + m— 1 = 0(n + m), the number of rows 
in the linear system in (II. ip . From a single source, we can estimate the response to all of 
the receivers by emitting a single probing sequence and solving (jl.ip at each receiver j. If 
there are multiple sources, then typically the sources are activated one-at-a-time. In this 
case, the total activation time required to estimate all of the /ijj becomes 0{p{n + m)). In 
theory, m could be made as small as we like in this situation, giving us a lower bound on 
the cost of 0{np). 

In this paper, we propose and rigorously analyze an alternative strategy for estimating the 
channels between each source-receiver pair. Our strategy will reduce the total amount of 
time spent on probing the channels by activating all of the sources simultaneously. (This 
approach was first proposed in the context of seismic imaging in a related conference paper 
[28].) Now, of course, the sources will interfere with one another, and the receiver will 
observe a combination of each source convolved with its respective channel. With all p 
sources active, the observations at receiver j can be written as the following system of 
equations 



lin 



P 

i=l 



lin 



2j 



L^p.i. 



(1.2) 



The in (jl.2p is the (m + n — 1) x np matrix formed by concatenating the source 



convolution matrices 



row-wise; the hj is the unknown np-vector consisting of the p 



channel responses for the path between each source and receiver j. With all of the sources 
activated simultaneously, the total cost of the acquisition is 0{n + m), but now the channel 



^We have take m > n here, which does not affect the discussion too much at this point, but will be 
important later on. 
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responses are interfering with one another. The question now is how long (quantified by 
m) the probing sequences must be to rehably "untangle" the individual hij from <l>'™/ij. If 
the probing sequences are chosen carefully and in concert with one another, the system in 
(jl.2p will be invertible for m ~ np, again making the total activation time 0{np). If we are 
interested in recovering all possible channels without making any assumptions about their 
structure, we of course cannot have m < np. 

We will show that if the combined channel response hj is sparse, then the probing sequences 
can be significantly shorter than np if they are random. This problem, along with the tools 
we will deploy to solve it, is closely related to recent work in the field of compressive sensing 
(CS). The theory of CS basically states that vectors xq with s non-zero components can be 
recovered from an underdetermined set of linear measurements y = ^xq if the the matrix 
is sufficiently diverse (the precise technical conditions are reviewed in detail in Section [2]). 
The essential contribution of this paper is to show that when the sequences {4>ii't)}t consist 
of independent and identically distributed Gaussian random variables, the matrix in 
(jl.2p meets this criterion for pulse lengths m that are within a poly-logarithmic factor of 
the sparsity s. In particular, Theorems 13.11 and 13.31 combined with Proposition 12.11 shows 
that if the total number of significant components in hj is s, then it can be recovered from 

m < s ■ log^(np), 

reducing the total time the sources are activated to 0{n + slog^{np)). When the channels 
are sparse, that is s <C np, then the cost of acquiring all of the channels is not much 
more than acquiring a single channel independently. While having the sources activated 
simultaneously introduces "cross-talk" between the different channels, the use of different 
random codes by each source coupled with the sparse structure of the channels allows us 
to separate the cross-talk into its constituent components. 

In the remainder of this section, we will discuss some applications of the channel separation 
problem and review recent related work. Section [2] provides an overview of sparse recovery 
from underdetermined linear measurements. Section [3] carefully states our main theorems, 
which provide a sufficient lower bound on the length of the probing signals (in relation 
to the number of sources, the length of the channels, and their sparsity) that allows us 
to robustly recover hj from y^™ using any number of sparse recovery algorithms. Proof 
of these theorems is given in Sections H] and [5l The proofs rely heavily on estimates for 
random sums of rank-1 matrices, which are overviewed in the Appendix. 

1.1 Applications 

For further motivation, we discuss three specific scenarios in which this multichannel sep- 
aration problem arises. 

Seismic exploration and forward modeling. Subsurface images of the earth are formed 
by activating different points on the surface with acoustic sources, then measuring the 
response at a number of receivers locations. From these recorded responses, a 3D subsurface 
model of the earth (consisting of the local velocity of the propagating elastic waves) can 
be reconstructed using what is known as full waveform inversion (FWI). Dense samplings 
for the positions of the acoustic sources lead to higher resolution reconstructions, but also 
longer field acquisitions and more computationally intensive inversion. 
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Figure 1: (a) Multiple sources on left, multiple receivers on the right. The black arrow denotes the 
direct arrival from source 2 to receiver 3. The colored arrows denote the indirect arrivals caused by 
the different reflectors (denoted by small circles), (b) Channel response /i2,3 between source 2 and 
receiver 3. 



The theoretical results in this paper suggest that these expenses can mitigated by activating 
the sources simultaneously using different random patterns. In the field, this reduces the 
amount of time required for the acquisition. Although the sources will interfere with one 
another, the individual responses can be separated afterwards by taking advantage of the 
sparsity of each of the channel response^. The source waveforms will have to be longer 
than if each of the p sources were activated individually, but the net activation time across 
all sources is much smaller than p individual channel probes. 

Sparse channel separation can also reduce the amount of computation required for the 
inversion. The most expensive step in wavefront inversion is testing a candidate model 
to see how well it fits the measurements that have been collected. This so-called forward 
modeling simulation consists of solving an extremely large PDE. The cost of this simulation 
is proportional to the length of the source signals (i.e. the number of time steps required), 
but does not depend at all on the number of sources that are active at one time — running 
a simulation with a single source active costs takes just as long per time step as with 
many sources active. If we simulate each of the p sources individually, we will need to 
run each simulation for 0(n) time for a total cost of 0{np) time steps (and the cost of 
each time step can be extremely high). If we simulate the sources simultaneously, then the 
number of time steps in the simulation can be 0(n + slog^(np)). Given the results of the 
simulation with simultaneous active sources, we will of course have to recover the individual 
channel responses using some type of sparse recovery algorithm (solving the optimization 
program in (j2.4p below, for example). But the computations required for this recovery are 
minor in comparison to the forward modeling simulation, especially given recent progress 
in optimization algorithms [2lll5 1[T9t l39j and the fact that the system <I>''° can be applied 
quickly using FFTs. 

Source separation for seismic exploration is explored in further detail in the companion 
publications [281129]. 

Channel estimation in MIMO communications. When information is transmitted 

^Sparse models are common in seismic imaging |10ll34j . In practice, additional gains are realized by going 
beyond the setting treated in this paper and modeling the channels jointly, viewing them as cross sections of 
a larger 3D image (with an associated sparse transform) rather than as individual sparse channels; see [22j . 
for example. 
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wirelessly, it is often the case that reflections cause there to be multiple paths from the source 
to the receiver. Instead of the transmitted waveform, the receiver observes a convolution of 
this waveform with a channel response — if the number of reflectors is small, this response 
is sparse. To compensate for this multipath effect, the channel is periodically estimated 
by having the source emit a known probing sequence that the receiver can subsequently 
deconvolve. If there are multiple transmitters and multiple receivers, we can save time by 
probing all of the channel pairs simultaneously, and separating the individual responses 
using sparse recovery. This approach is particularly useful when the channel is changing 
rapidly, a common problem in underwater acoustic communications [9]. 

Coded aperture imaging. In |24H26j . an imaging architecture is introduced to increase 
the field-of-view (FOV) of a camera using coded apertures. A coded aperture is a series of 
small openings (apertures) whose net effect is to convolve a target image with a sequence 
(code) determined by the pattern in which these openings appear. Coded apertures offer a 
way around the classical trade-off between aperture size and image brightness; the multiple 
apertures overlay many copies of the image at slightly different shifts, making the image 
incident on the detector array "bright" and easily recovered (via deconvolution) if the 
aperture code is chosen carefully (e.g. the MURA patterns in jl7j). 

The essential idea from [23] to increase the FOV without sacrificing resolution is illustrated 
in Figure [2) the image is broken into p subimages, each of which we would like to recover 
to a resolution of n pixels. Rather than measuring each subimage directly, which would 
require a detector array of size m = np, we pass each image through its own coded aperture 
of size m and these coded subimages are combined onto a detector array of size m. The 
task at hand, then, is to recover the full np pixel image from these m measurements. 

This problem also conforms to our multichannel frameworl^d — in this case, we have p 
sources and one receiver. Here, the known "probe signals" are the coded aperture patterns 
and the unknown channels /ij^i are the different subimages. The main results of this paper 
say that if the entire image is approximately s sparse, than the size m of the detector array 
needs only to be only on the order of s (within a log factor) rather than the full resolution 
np. If the images we are reconstructing are consecutive frames in a video sequence, the 
image sparsity can come from looking at the differences between consecutive frames. 

1.2 Relationship to previous work 

We have cast the multichannel separation problem as recovering a sparse vector from an 
under deter mined, random system of equations. This general problem has been studied ex- 
tensively over the past five years under the name of compressive sensing (CS). The essential 
results from this field state that if we observe y = ^xq, where xq is s-sparse and $ is an 
m X N matrix that obeys a technical condition called the restricted isometry property (see 
(j2.3p below), then xq can be recovered from y even when m <^ N, and this recovery is stable 
in the presence of measurement noise, and robust against modeling error (i.e. it is effective 
even when xq is not perfectly sparse) [6ll7 1ll3j . Random matrices that obey this property for 

■^Passing an image through a coded aperture has the efTect of convolving it with a binary code. The 
theoretical results presented in this paper require the code to be Gaussian; this requirement was imposed so 
that each convolution could be diagonalized in the Fourier domain, which allows us to apply recent results 
from the theory of random matrices to prove Theorems 13.11 and 13.31 In practice, we would expect there to 
be little difference between then Gaussian and binary cases. 
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Figure 2: Sketch of the architecture proposed in \24] for increasing the field of view of a camera. 
The image is broken into p subimages, each subimage passes through a different coded aperture, 
and the coded subimages are combined on the detector array. The effect of the coded aperture is 
to convolve the subimage with an associated code, and so (|1.2p models the process mathematically. 

m within a logarithmic factor of s include matrices with independent entries [HE], matrices 
that have been subsampled from orthobases consisting of vectors whose energy is almost 
evenly distributed between their entries [33]) as well as other matrices with more structured 
randomness |3ip37j. These measurement systems provide efficient encodings for xq because 
the number of measurements we need to make is roughly proportional to the number active 
elements. The results from Section [3] tell us that matrices formed by concatenating a series 
of p random convolutions are another such efficient encoding (with N = np). 

Early work on signal processing algorithms using sparse models for channel estimation can 
be found in [11] and [16] . In [2^ , estimation of a single channel using a pulse consisting of a 
sequence of independent Gaussian random variables is explored; the mathematical results 
of [20] are framed in the language of CS, and the key recovery condition (the restricted 
isometry property in (|2.3p below) is established for pulse lengths of 0{n + s^logn). The 
paper [30] show that the recovery conditions can be improved to 0{n + slog^ n) when the 
observations are noiseless and the channel is exactly s-sparse. Using convolution with a 
random pulse to perform compressive sensing was also considered in the context of imaging 
in [31] and as a way to handle streaming data in [38]. Results for super-resolved radar 
imaging using ideas from CS can be found in [21J. In this paper, the undetermined system 
arises not because we are subsampling a signal after it has been convolved with a pulse, 
but by combining the convolutions from multiple channels into one observed sequence. 

Multichannel separation also bears some resemblance to the problem of finding the spars- 
est decomposition in a union of bases [5lll4 1 [T8 | l36j: this resemblance becomes even more 
pronounced when we recast the problem using circular convolution (see Section 13. ip and 
take m = n. We can think of each convolution matrix as a different basis, and search for 
a way to write the measurements as a superposition of a small number of vectors chosen 
from these bases. In contrast to previous work on this problem, the bases here are random 
and not quite orthogonal (the related paper [32] considers an alternative way to generate 
the random pulses so that each of the convolution matrices is exactly orthogonal). 
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2 Sparse recovery from underdetermined measurements 



In the previous section, we set up multiple channel estimation as a linear inverse problem. 
Classically, these types of problems are solved using least-squares; the stability of the 
solution is almost completely characterized through the eigenvalues of If for all x E 

M"P we have 

{l-5)\\x\\l < ||$x||i < {l + 5)\\x\\l (2.1) 

for some small < 5 < 1, then recovering x from <&x is well-posed and stable in the presence 
of noise. Of course, m < np, then the system will be underdetermined, $ has a nonzero 
nullspace, and the lower-bound in ()2.ip cannot hold. It appears that to simultaneously 
estimate all of the channel responses, the length m of the probe sequence must exceed np. 

Recent results from compressive sensing have told us that if the vector we are trying to 
recover is sparse, then a much weaker condition on <I> is sufficient for well-posed, stable 
recovery. In particular, if (j2.ip holds for all 2s-sparse vectors x, rather than all x € M"^, 
then we will be able to recover hj from rj = ^hj about as well as if we had observed the s 
largest (most important) entries in hj directly. 

We can make this precise in the following manner. Denote by B2 the set of all vectors 
X G ]R"P that are nonzero only on the set F C {1, • • • ,np} and have unit £2 norm. For a 
square matrix A, we define the || • \\s norm as 

\\A\\s= sup \y*Ax\, (2.2) 
|r|<s 

where the supremum is taken over all s-sparse vectors with unit energy. (We use * for the 
transpose of a real-valued vector or matrix, or conjugate-transpose for a complex-valued 
vector or matrix.) It is easy to see that if 

||/_$*$||^ < 5s 

then 

{l-5s)\\xf2 < ll^^lli < (1 + for all s-sparse x. (2.3) 

Establishing (|2.3p , which has gone by the names uniform uncertainty principle and restricted 
isometry property in the CS literature, is the key for stable sparse recovery [3l[6ti8|[27]. The 
following proposition gives us a concrete algorithm for recovering a sparse vector from 
measurements made by a matrix that satisfies ()2.3p . 

Proposition 2.1 ( [6]) Let xq be an s-sparse vector, and ^ be a matrix that obeys (12. 3p 
with 6cis ^ C2, where Ci > 1 and < C2 < 1 are constants. Given noisy observations 
y = ^xq + e, where e is an error vector with norm at most ||e||2 < e, the solution xq to the 
optimization program 

min ||x||i subject to \\^x — y\\2 < e (2-4) 

X 

will obey 

\\xo-xo\\2 < Cse, 
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where C3 is a known universal constant. In addition, if xq is a general non-sparse vector, 
then the solution to (j2.4p obeys 



where xq^s is the best s-sparse approximation to xq; the nonzero components in xo,s are the 
s largest components of xq. 

The constants in the theorem above are known to be small. For example, in it is shown 
that we need Ci = 2 and C2 < \/2 — 1, and with C2 = 1/4, we have C4 < 6. Similar 
stability results hold for recovery procedures other than ii minimziation. In particular, 
in |27] and [3], it is shown that particular types of iterative thresholding algorithms can 
achieve essentially the same performance after a very reasonable number of iterations. 

The main result of this paper, codified in Theorems 13.11 and 13.31 below, is that the which 
arises in the multichannel separation problem will obey the restricted isometry property 
(j2.3p for s almost proportional (within a log factor) to m. 

3 A multichannel separation theorem 
3.1 From linear to circular convolution 

Rather than analyze the spectral properties of <I>^™ in (II. 2p directly, we will replace it with a 
slightly modified version whose components are submatrices of large circular matrices, and 
thus can be diagonalized in the Fourier domain, which simplifies the analysis considerably. 
To do this, we simply "pre-process" the measurements by adding some of them together to 
create a slightly shorter observation vector. 

To start, consider the single source measurements ?/|™ in equation (jl.ip . with the pulse 
length m exceeding the length of the channel n. Suppose that we add the first n — 1 
measurements to the last n — 1 measurements to form 





yf;^(m + l) + yf-(l) 



yf^{m + n - 1) + yf^{n 



1), 



(pi{n) 



4>i{n - 1) 



(f>i{m) 

0^(l) 



(t)i{m - 1) 
(pi{m) 



4>i{m - n + 1) 
(t)i{m - n + 2) 




(j)i{n-l) (j)i{n-2) 



4>i{m) 
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The matrix $j consists of the first n columns of an m x m circulant matrix with 

r* = [(l)i{n) ■■■ (pi{l) 4>i{'m) ••• (^^(n + l)], 

as the first row. As such, we can use the discrete Fourier transform to diagonalize Let 
F be the m x m normalized discrete Fourier matrix with entries 



F{uj,t) 



1 



-j27r(aj-l)(t-l)/m 



m 



and let -F(i:„) denote the m x n matrix consisting of the first n columns of F. Then 

^^ = F*G^F^^.,^), With = diag(teM}I^Li). (3.1) 
The vector gi{uj) is the (re- normalized) Fourier transform of r^: 

gi = m- Fri. 

When all the sources are active simultaneously, we can perform the same manipulations 
on the composite linear system ()1.2p . combining the first n — 1 entries in y'™ with the last 
n — 1 to yield 



h 



2J 



As in (13.ip . we can write $ as 



(3.2) 



(3.3) 



We assume that each source emits an independent random waveform. That is, we take the 
probe samples {0i(i)}i,t to be iid Gaussian random variables with zero mean and variance 
(so each probing waveform has unit energy in expectation). Since the 4'i{t) are iid 
Gaussian, the corresponding Fourier transforms gi{u}) are sequences of conjugate symmetric 
complex- valued Gaussian random variables: 



5j(w) 



j Normal(0, 1) 



u} = 0, m/2 + 1 



[Normal(0, l/\/2) + j • Normal(0, 1/^2) a; = 2, 
and gi{u}) = gi{n — uj + 2)* for a; = n/2 + 2, , 



,m/2 



,n. 



3.2 Recovery theorems 

Our main theorem shows that the random matrix generated from the random sequences 
{gi{ijj)} as in (j3.2p . is an approximate restricted isometry in expectation for m > slog^(np). 



Theorem 3.1 Let <I> he as in ()3.3p . There exists constants C5 and Cq such that 

I Gc • .9 • Inpr^ .s Inf^f m.r. 

E||/-$*$|L < 



G5 • s • log^ s log^(mp) log(np) 



m 



(3.4) 



when 



m > Cq ■ s ■ log^ s log^(mp) log(np). 
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It is straightforward to turn Theorem 13.11 into a direct statement about the restricted 
isometry constants. 



Corollary 3.2 There is a constant C7 such that 

E||/-$*^>||^ < 6s 

when 

myCjS-'^-s-log^inp), (3.5) 
for any < 5^ < 1, provided that m < np. 

To see how (j3.5p fohows from (|3.4p . notice that for m < np and s < np, we have 

log(mp) = logm + logp < 21og(np) 

and so 

C5 ■ s ■ log^ s log^{mp) log{np) ^ / 4C5 • s ■ log^(np) 
y 771 ~ y m ' 

and we can choose m as in (13.51). 



Theorem 13.11 gives us a lower bound on the length of a pulse sufficient to endow, in expecta- 
tion, $ with certain restricted isometry constants. The following theorem gives us a lower 
bound for the length of the pulses that guarantees that <5 has certain isometry constants 
with high probability. 



Theorem 3.3 Let ^ and 6s be as in Theorem \3.1\ There exists constants Cg and Cg such 
that 

F{\\I-^*ns>Ss} < Csinpy^ 

when 

m>Cg-6~'^-s- log^ {np) . (3.6) 

It is worth mentioning that we chose a probability of failure of ~ {np)~^ mostly out of 
convenience. In fact, the probability can be made arbitrarily small by adjusting the constant 
Cg; we could achieve a failure rate of (np)~" for any a > 1 by making the constant in ()3.6p 
Cga. 

The essential consequence of the next theorem is that for pulse lengths (13. 6p . we can si- 
multaneously estimate the channel responses from all sources to receiver j, which are con- 
catenated in the vector hj, from either concatenated circular convolution observations ^hj 
or concatenated linear convolution observations <I>^™/ij. As linear convolution observations 
are more typical, we state our channel separation corollary in terms of <I>^™. 



Corollary 3.4 Suppose we observe 

where y^™, ^>'"^, and hj are as in (jl.2p and e is an unknown vector of measurement errors 
with ||e||2 < e. Take Ci, C2, and C4 as in Proposition \2. 1\ and take m as in Theorem \3.3\ so 
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that 5cis ^ C2, where 6cis is the isometry constant for the concatenated circulant matrix 
$ generated from as in (j3.2p . Then the solution hj to 

min \\h\\i subject to - 7/"''||2 < e. 

h ■' 

is a close approximation to hj in that 

Whj-hkh < C4 (^\/2e + s"^/2||/ij - /ij^lli) , (3.7) 
where hj^s is the best s-term approximation to hj. 

Proof Theorem 13. 31 coupled with Proposition 12. II give us robust reconstruction for observa- 
tions made through the concatenated circulant system To establish the Proposition, we 
will make a concrete connection between the solutions to the linear and circular convolution 
inverse problems. 

First, we consider the case where there is no noise and hj is perfectly s-sparse. Given 
the circular observations y = ^hj, we could solve (j2.4p with e = 0, making the constraints 
$x = y. With m as in (13. 5p . the solution hj will be exactly hj with high probability. Stated 
differently, there is no vector in the nullspace of $ that can be added to hj that lowers the 
£1 norm. Since Null($'*°) C Nun($), we could also solve (12. 4p with y'™ and <I>^™ in place of 
y and <1> and recover the signal exactly. 

To make the connection when there is noise, we use the following proposition, which is 
contained in [H|6], but is slightly stronger than Proposition 12.11 

Proposition 3.5 Under the conditions o f Proposition if d is any vector that satisfies 
W^o + c^lli ^ ll^^olli o-i^d \\^d\\2 < 2e (both of which must be true for d = xq — xq), then 

\\d\\2 < C- (e + s~^/^xo-xo,sh) . 

Now suppose we solve ()2.4p with observations y^™ and matrix <1>'™, denoting the solution 
hj"^ and set d = h}j^ — hj. Since hj is feasible, we will have both \\hj + d\\i < \\hj\\i and 
||$''"d||2 < 2e. We can write = ^<I>^™, where A combines the first and last n — 1 elements 
of a vector. Since the maximum singular value of A is \/2, we also have 

\\<^d\\2 = \\A'^''^d\\2 < ^/2||$i"^d||2 < 2^26. 

Thus the solution hj^ is as accurate as solving (j2.4p with the circulant observations yj and 
matrix $ with e increased by a factor of \/2- Thus hj will obey (13. 7p . 



4 Proof of Theorem 13.11 

The essential tool for establishing (j3.4p is a variation (Lemma IA.2P of a lemma due to 
Rudelson and Vershynin (Lemma lA.ip . Most of our efforts will go towards manipulating 
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/ — to put it in a form where we can apply Lemma [A. 2 [ The basic flow of the proof is 
to divide / — into several parts, each of which can be written as a sum of independent 
rank-1 matrices, and then apply the bounds in the Appendix to each part. This process 
is not difficult, but it is somewhat laborious. To aid the exposition, we have divided the 
proof into steps, each one of which accomplishes a particular task. 

We will not track the constants. We will use C to denote a constant that is independent 
of all the variables of interest {s,n,m,p); the particular value of C may change between 
instantiations. We will give a constant a label in the subscript if we want to refer to it later. 

To start, we set Z = / - 



El. Write Z as a sum of rank-1 matrices. Recall from (j3.3p that we can write the 
multichannel convolution matrix <^> as 



^ = F* [GiF(i.„) G2-F(i;n) • • • GpF(^l:n)] 

where the Gk are diagonal matrices consisting of the re-normalized Fourier transforms 
of the sources. We can write <!)*$ in matrix form as 



(J)*<|) 



77* 

^{l:n)^l 

77'* 

^{l:n)'^2 



77* ^* 



FF* G2F(i. 



n) 



(l:n) 



(l:n) 



F, 



{l:n). 



G*pGp 



F(l:n) 



where we have used the fact that FF* = I. We can compact this expression by 
introducing f^^uj G C"^ as the vector which has column u of in entries {k — 

l)n + 1, . . . ,kn and is zero elsewhere. Then we can rewrite as 



p p m 



k=l j=l uj=l 



p m 



Since 

^ ^ fk,ujfk,Lu ~ F 

fc=iw=i 

we can now write Z = / — as 



(4.1) 



(4.2) 



p m 



k=lu}=l j^ku]=l 

■.= Hi+H2. 



(4.3) 



Noting that E ||Z||s < E ||i?i||s+E ||i?2||s, we will proceed by bounding each of E \\Hi \ 
and E ||-ff2||2 in turn. 
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E2. Bound E||i/i||s. We start by making the random variables in the expression for 
Hi symmetric. Let H[ be an independent copy of Hi created from an independent 
Gaussian sequence {g'i.{uj)}k,u), and set 



p m 



Y = Hi-H[ = Y, - \9k{^)\')h,.fl.- 

k = l LO=l 

Our strategy is to control \\Y\\s and use that fact E < E ||y||c,, since 



(4.4) 



E — E \\Hi — YjH'iWs 

= ^\\E[Hi-H[\Hi\\\^ 
<F.[^\Hi-H'i\l\Hi] 
= E \\Hi — Hi \\s 



{H'l is zero mean) 
(independence, ^[H[\Hi] = E[H[]) 
(Jensen's inequality) 
(iterated expectation). 



Next, we randomize the sum in (|4.4|) . Y has the same distribution as 

p m 

^' = E E e^HdffU^)!' - \9k{oo)\')fk,.fl., 

k=lu}=l 



(4.5) 



where {ek{i^)}k,u) is an independent Rademacher sequence — the 6^(0;) are iid and 
take values of ±1 with equal probability. Note that 

E||F||, = E||y'||, = E[E[||y'||, I {gk{uj)},{gi{uj)}]]. 

Third, apply Lemma [XT] with Vk^ui = I IffU'^)!^ ~ Idki^W \^^'^fk,u)- We define the 
random variable B as 

B :=maxmax{\gk{uj)\,\g'k{uj)\} > maii\\gk{uj)\'^ - \g'k{u})\^ {^^"^ , (4.6) 

k,oj k,uj 

and note that 

lbfc,u;||oo < max| |5rfc(w)P - |5fc(t^)|^ 1^/^ • ||/fc,a;||oo < B / ^/m. 



With the {(/fc(a;)}, {g^(tj)} fixed. Lemma lA.ll (with M = B j ^/m) tells us that 

p m 

Y.Y.^\g'k{^^? - \gk{^)?)h,Jl, 



E [||y% I te(u.)}, KH}] < J£:l:i:i£i!^ . b 



1/2 



fc=i w=i 



where L{s,n,m,p) = log^(s) log(np) log(mp) — to make things more compact, we 
will abbreviate this with L, and remember that the quantity depends on the sparsity, 
length of the channel, length of the pulse, and number of channels. Then by the 
Cauchy-Schwarz inequality 



E\\Y\L<^l^^.{E{B']y/'-iE 
m \ 



p m 



EE(l5fcHP-|5.(u;)P)A,./fcV 



fc=laj=l 



1/2 



(4.7) 
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We can estimate E[S^] as follows. is the maximum of the \gk{uj)\'^, \g'f^{uj)\'^ , which 
are chi-squared random variables of degree 2 (when 2 < uj < m/2) or 1 (when u = 
l,m/2 + l). In either case, E[|(7fc(a;)p] = 1, and 

P{|5fc(^)P > n} < e-\ 

and since there are {m/2 + 1) ■ p ■ 2 = (m + 2)p unique magnitudes among the 



P{B^ > u} < min(l,(m + 2)p-e"") 



Since is a positive random variable 



E[B^ 



P{S2 > u} du 

< log((m + 2)p) + (m + 2)p / 
= log((m + 2)p) + 1. 



(4. 



(4.9) 



e"" du 



log((m+2)p) 



Combining this with the fact that 

p m 



E 



fc=laj=l 

the bound in (14.71) becomes 



< 2E 



p m 



^^\9k(.Uj)\^fk,ufk,u; 



k=lui=l 



Eiiyii. < 



C ■ s ■ L log(mp) 



m 



E 



fk,u)fk,u 



k 



1/2 



Using (j4.2p and the fact that E = 1 yields 

Y.Y.^l-\9k{oo)\')h,.fl, 



^\\Y\\s < 



C ■ s ■ L log(mp) 



• E 



m 



k CO 



1/2 



+ 1 



(4.10) 



< 



C ■ s ■ L log(mp) 



m 



C ■ s ■ L log(mp) 



m 



{E\\Hi\\s + l) 



1/2 



■{E\\y\\s + i 



Invoking Lemma iB.ll with /3 = E||y||s, a = y^CsLlog{mp)/m, and c = 0, we see 
that there exist constants Cio,Cii such that when 



m > Cio • s • L log(mp) 



we will have 



E||i^i||s < E||y||^ < c 



11 • 



s ■ L log(mp) 



m 



(4.11) 
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E3. Decouple H2. Set 

m 

H2 = (4-12) 

j:^k U) = l 

where {^^(w)} is an independent sequence of random variables with the same distri- 
bution as {gi.{u})}. We can now control ||-f^2||s by controlling ||ff2llsi because 



E 1 1 H2 lis < Ci 2 E 1 1 H!) 



(4.13) 



For proof of ()4.13p . see \12\ Section 3.1], which also provides a precise value for the 
constant Ci2- 

E4. Add back the diagonal. Write 

p p rn p m 

^2 = EEE^'^^^)*^^'^)^--/^- - Y.Y.9k{^y9'k{^)f^^-fl^ (4.14) 

j=lfc=la;=l k=lu}=l 

:= + 



E5. Bound E||i74||s. Denoting the angle of the complex number gk{uj)* g'f^{uj) as 6k{uj), 
H4 has the same distribution as 

K = Y,Y.'k{uj>^,.vl^. uk,^ = ej^^('^)/2|5,(a;)|A,^, Vk,^ = e-^'^^^^^'lg'.iuM^^ 

k u) 

(4.15) 

where {ek{u})} is an independent Rademacher sequence. With B as in (j4.6p . it follows 
that 

||^ife,w||oo < B/y/m and ||uA:,a;||oo < B / y/m. 

With the {gkioj) ^ g'f^{ijj)} fixed, we apply Lemma [4.21 to get: 



E[ iiF^ii, I te(c^)},KM}] < a/^-^ -s- 



m 

p m 



j;5^|5.(o;)pA,.A* 



k=lu)=l 



1/2 



p m 



k=lui=l 



As in (j4.7p . we use the law of iterated expectation and the Cauchy-Schwarz inequality 
to remove the conditioning: 



eII^1II.<a/^^-(e[b2])V2. 



m 



E 



p m 



k = l UJ = 1 



1/2 



+ 



p m 



fc = l UJ = 1 



1/2N 



1/2 



The {gk{ijj)} and {^^(w)} are identically distributed, and so using Jensen's inequality: 



p m 



j;j;i5.(o.)pA,.A* 



fc = lLJ=l 



1/2 
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Using the bound in (j4.10p in Step IE21 we have 



E 



p m 



fc=iw=i 



< E 



p m 



k=i ijj=i 



1/2 



+ 1 



(E||^i||. + 1 



,1/2 



Similar to ()4.1ip (but with different constants) we see that m > C ■ s ■ Llog{mp) 
implies E||iJi||s < 1. As we reasoned in (j4.9p above, we will also have E[i?^] < 
log((m + 2)p) + 1, and so there are constants C13, C14 such that 



E ||-f^4||s < Cl4 • 



s ■ L log(mp) 



m 



(4.16) 



when 

ITT' > • s ■ Llog(mp). 
. Bound E ||i/3||s. H3 has the same distribution as 

p p m 

j = l k=lu}=l 
m / p \ / ^ 

UJ = 1 \k = l / \j = l 

m 



where 



^9k{^Tfk,uj and v^j = ^g'j{uj)* fj^ui- 
k=i j=i 



The fk^uj have disjoint support for different values of k, so \\uu 
where B is defined as in (14.61). Also note that 



(4.17) 

\vu]\\oo < B/^/m 



m p p 



E E E 9k{^)*9j{^)fk,ufi^ 

uj=l k=l j = l 



and so recalling ()4.ip . we see that u^W^ and Yl^=i '^t^'^i are independent real- 

izations of Lemma I A. 2 1 and Cauchy-Schwarz tell us that 



C-s-U 



E ||i73||. = E \\H% < x rZ^ ■ (E[i?2])i/2 . (E ||cI>*ci,||,)i/2 



m 



<C,5-,/^^^^^^^^-(E||Z||. + l)V2, (4.18) 



where L' := L'{s,n,m,p) := log^ s log m log np. Since L'{s,n,m,p) < L{s,n,'m,p), 
we can replace L' with L above. 
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E7. Collect the results. To summarize, we have shown that 



E||Z||^ < E||i7i||, + E||iJ2|U (from g3D in StepET]) 

< E||ifi||, + Ci2E||i?^||, (from KW in Step [Ml) 

< E \\Hi\\s + C12C15 E WHsWs + C12C14 E \\H4s (from (gH]) in Step EH). 

For m > max(Cio, C13) • s ■ Llog(mp), we also have the bounds 



E||i/i|U < Cn\l^^^^^^^^ (from gH]) in StepEl 

m 



E\\H4s<Cu\l^^^^^^^^ (from glU) in Step m 



^WH-iWs < P-^^Qg("^P) . (E\\Z\\, + ly/^ (from m Step|E6]). 



m 

Thus 



E||Z||.<cV^^^^^^-(2 + y^^^). 
V m 

Using Lemma IB. 11 we see that there is indeed a constant Cg such that when m > 
Cq ■ sLlog{mp), we wiU have 



E III - $*$||. = E ||Z||. < C . s . Llosjmp) ^ 



5 Proof of Theorem D 



We begin with a brief overview of the steps we will use to establish Theorem 13.31 We will 
use the same decomposition of Z as in Section [H dividing Z into Z = Hi + H2, decoupling 
H2 to get H2, then dividing H2 into H2 = + H/^. The essential idea is that we have 
estimated the means of Hi^iH^, H-ffsH^, and 11-^4 ||s in the previous section; we will use these 
estimates and the concentration inequality in Lemma [A. 41 to derive a tail bound for for each 
of these components in turn. 

The main nuisance is that while we can write i^i, i?3, and Hi^ as sums of independent 
random rank-1 matrices, the norms of these matrices are not bounded (as Gaussian random 
variables are not bounded). To handle this, we define the random variable B as in Section 3] 

B = maxmax{|5rfc(w)|, |fffc(a;)|}, 

and then derive an estimate for ||^||s conditioned on the event 

M = {B^ < M} , 

where we will choose M so that Ai likely to occur: 1/2 <C P{Ai} < 1. We will use E_a/( 
and P_A/( to denote expectation and probability conditioned on the event Ai occurring. 

We start by decomposing the tail bound as 

P{||Z||, >5} < P{\\Hi\\s>5/2} + F{\\H2\\s>6/2}. 
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Step EI] below bounds P{\\Hi\\s > 6/2}. Step [P2] decouples the sum for H2 to get 

F{\\H2\\s>S/2} < CF{\\H!,\\s>6/2C}. 

Steps IP3I and IP4I then condition on Ai , 

CP{||i/^||, >V2C} < C {Fm{\\H'2\\s > 5/2C} + F{M''}) , 

divide H!^ into H!^ = Hs + H^, 

Pm{\\H'2\\s>S/2C} < Fm{\\H3\\s>S/AC} + Pm{\\H4s>S/AC}, 

and then bound P_A4{||-ff3||s > 6/40} and PA4{||-f^4||s > S/4:C} in turn. These individual 
results are unified to finally establish the theorem in Step IP5[ 

We will control each probability with a parameter 7, which can be selected as < 7 < 1/2, 
and derive a bound for m so that the total probability of failure is 0(7). 

PI. Tail bound for Recall the definitions of y = Hi — H[ and Y', which has the 

same distribution as Y, from (14. 4p and (14. 5p . We can develop a tail bound for 
from a tail bound for \\Y\\s (or by following \23\ Section 6.1]. For any a, A > 

P{||i^;i|, < a}P{||i/i||, > a + A} < P{||y||, >A}. 

In particular, if we take a = 2 E \\Hi \\s = 2 E we will have P{||-ff{ \\s < 0} > 1/2, 

since the median of a positive random variable is no more than twice its mean, and 
so 

P{||i7i||, >2E||i/i||, + A} < 2P{||y||, >A} = 2P{||y'||, > A}. 

To bound the right hand side, we first condition on Ai: 

F{\\Y'\U>X} < Pm{\\Y'\\s> X} + P{M'}- 

Conditioned on A4, each term in the sum that comprises Y' has bounded norm, and 
so we can apply Lemma |X1 with Vk^oj = I ~ \fk,ujfk,w^ noting that 

\\Vk,u,\\s < M • \\fk,^fUs = M ■ s/m. 

This yields 

Pxjr'll. >c(^nE^||y'||, + t^^| <e-"'+e-*. 
From (j4.1ip . we know that 



when m > C ■ sLlog{mp). Plugging this into the expression for the tail bound gives 
us 

Pm { \\y \ s > C uJ + t U < e " +e *. 

\ \ m ml] 
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Take u = ^/log I/7, t = log 1/710 get 



{ \\Y% > A} < 27, X = C J^lMMm + log(l/7)^ . (5.1) 

\ V m m j 

With this value of A, we can use the bound (14. IIP for E to get 



2E||i7,||. + A < C I + >^^^l°g(V7) +iog(i/,)£jl . 

\ V m V m m / 

Since we are choosing m to make all three terms above less than 1, the middle term 
will dominate. We see that there is a constant Cig such that 

m > Ci^-5-^-s-LM\og{lh), 

implies 

2F.\\Hi\\, + X < 6s/2, 

and hence 

F{\\Hi\\s> Ss/2} < 47 + 2P{A^^}. (5.2) 



P2. Decouple H2. In Step IE31 we saw that we could decouple H2 and add back the 
diagonal, giving us the decomposition (j4.14p . We can also derive a tail bound for 
||i72||s using the fact that 

P{||i^2||s > A} < Cu ■ P{||^2ll. > A/C17} 

for a universal constant C17, where H2 is the "decoupled" version of H2 given by 
(|4.12p (for proof of this and an explicit value for C17, see [HI Section 3.4]). We will 
decompose H2 = H-^ + as in ()4.14p and proceed by finding tail bounds for H-ffsHs 
and 11-^4 lis conditioned on Ai. 

P3. Conditional tail bound for ||ff4||s. We start with the tail bound for ||i?4||s. Recall 
that H4 has the same distribution as H'^ in (|4.15p . Using (I4.16P from Step IE5[ we 
can bound the conditional mean 



EA^II^ill. < < c/^^^°^("^^^ 



F{M} ~ V m 

Recall that we can write H'^ as a random sum of rank-1 matrices as shown in (j4.15p . 
Conditioned on A4, 

\\uk,u>vlj\s < M\\fk,u>flJs = M-s/m. 



We now apply the concentration inequality (IA.7P as before with u = y^log(Ci7/7) 
and t = log(Ci7/7): 



M 



m m \ 
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Since we are making both terms on the right-hand-side inside the probabiHty brackets 
less than one, the first one wih dominate. Thus there exists a constant Cig so that 

m > Cis-Sj^ ■ s- LMlog{l/-f)- 

imphes 

Pa4{||^4IU ><^./4Ci7} < 27/C17, 

and finally 

Pm{||^4||. ><5,/4Ci7} < 2j/Cn, (5.3) 
since H/^ and H'^ have the same distribution. 

P4. Conditional tail bound for ||//3||s. As in Step IE6| H-^ has the same distribution 



as 

m 



LU = 1 



with U;^,Vuj as in ()4.17p . In (j4.18p in Step IE6l we showed that 



E\mu < g / ^^MM (E||zii.+i)V^ 

V m 



and in ()4.19p . we showed that E ||.^||s < 1 for m > CsLlog{mp). So for this range of 
m, we have E ||i^3||s < Cm~^^'^{sLlog{mp)y^'^ and so 



F llFf'll < ^11^311^ ^ s-Llog{mp) 



Conditioned on Ai, 



luujV^Ws < sup \y*Ui_j\- sup \x*Vuj\ < M-s/m. 
\r\<s \r\<s 



We apply the concentration inequality ()A.7p with u = y^log{Cn /^) and t = log(Ci7/7), 
yielding 



Below we will see that we can take M ~ log(mp/7); this means that there exists a 
constant C19 such that 

m > Cig-6-'^-s-LMlog{l/-f) 

implies 

Pm{\\H3\\s>Ss/4Ci7} < 27/C17. (5.4) 
P5. Collect the tail bounds. 
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We have shown that 



P{||Z||, > <^,} <P{\\Hi\\s > 6s/2}+P{\\H2\\s > 6s/2} 

<P {\\Hi\\s > 6s/2} + C17F {\\H!,\\s > 6s/2Cir} 

< P{||i3-i||, > 5s/2} + Ci7Fm {\\H'2\\s > 5s/2Ci7} + Ci7P{M'} 

<F {\\Hi\\s > 6s/2} + Ci7Fm{\\H3\\s > Ss/^Cn} + 

Combining the results from Steps [Fl|P3l and \P4\ we see that for any < 7 < 1/2, 
there is a constant C20 such that when 

m > C20 • • s • LMlog(l/7) 

we will have all of the following 

P{||ifi||, > 6s/2} < 47 + 2P{7W^}, (from ^ in StepEI]) 

C17PM {\\Ha\\s > <^s/4Ci7} < 27, (from ^ in Step[P3]) 

C17PM {WHsWs > Ss/iCi7} < 27, (from ^ in StepEl. 

It remains to fix M. As PiB"^ > M} < min(l, (m + 2)p ■ e^^) — recall (jMD — 
choosing 

M = log(Ci7(m + 2)p/j) Ci7 P{M'} < 7. 

With this choice of M (and assuming that C17 > 2), 

P{\\Z\\s>5s} < IO7 (5.5) 

when 

m > C ■ 5~'^ ■ s ■ L\og{mp/^) log(l/7). (5.6) 
We establish the theorem by taking 7 = C{np)~^ and noting that 

Llog(mp/7) log(l/7) < log^(np) 

and so taking m as in (j3.6|) will guarantee (|5.6|) and hence (|5.5|) . 



A Random Matrices 



A.l Random sums of rank-1 matrices 



The theoretical results in this paper depend crucially on our ability to estimate the size of 
the II • \\s norm of random matrices that can be written as the sum of independent rank-1 
matrices: 

(A.l) 

s 

where the Ui and Vi are vectors in C" and the {ej} are iid Bernoulli random variables taking 
the values ±1 with equal probability. Taking U, V as the n x m matrices with the 



E 
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columns, and letting E be the diagonal matrix with Sjj = e^, (|A.1|) can be written more 
compactly as 

In [33], Rudelson and Vershynin provided a bound for the expectation of (jA.ip when U = V. 
The following is Lemma 3.8 in |33j : 



Lemma A.l Let the vectors vi and the matrices V and S he defined as above, and suppose 
that ||fi||oo < ■ Then for some constant C , 

E||ySl/*||^ < - log s^logm log n - ||V"^*||y^ (A.2) 



The following is the analogous result for the more general case when U ^V: 



Lemma A.2 Let V, S be as in Lemma \A.1\ and let U be another n x m matrix whose 
maximum entry is less than M . Then for some constant C , 

E\\UY.V*\\s < C-M- • logsVlogmlogn • (llFy*!!^^ + ||f/f/*||y^) . 



Proof As in [33], we can bound ||C/Sy*||<j by the supremum of a Gaussian random process. 
Letting {gi} be a sequence of iid Gaussian random variables with zero mean and unit 
variance, we have 



E||C/Sy*||=E sup 
|r|<s ^ 



<CE sup 

|r|<s ^ 



1 

■m 

y^^gi{Xa,Ui){Vi,Xb) 



i=l 



(A.3) 



We now apply the Dudley inequality (see, for example, [35|. Chapter 2]), which states that 
for a Gaussian process G{x) indexed by a set x G T, the expected maximum value of G 
over T obeys 

poo 

Esup|G(2;)| < C \og^/^ N{T,6,t)dt. (A.4) 

x&T Jo 

Above, N{T,5,t) is the t-covering number for T under the metric 6{x,y) = (E[\G{x) — 
G(y)P])^/^. The process in (1A.3P is indexed by two vectors Xa,Xb, so here 

G{xa,Xb) = '^gi{xa,Ui){vi,Xb), and T = [J 62^82, 
i \r\<s 



with the metric 5 given by 



5 {{Xa,Xb), iya,yb)) = ^ {{Xa,Ui){Vi,Xb) - {Va, Ui) {vi, yb))" 



1/2 



\«=1 
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We can bound this distance using 

^ / m \ 1/2 

5 {{Xa,Xb), {ya,yb)) = 2 ( X] i{^a + ya,Ui){vi,Xb - Vb) + { i){vi,xb + yb)) 

1 / 

< - • max{\{ui,Xa - ya)\, \{Vi,Xt - yb)\) I '^{\{Xa + ya,Ui)\ + \{xb + yb, 

\i=l 

< R ■ max{\{ui,Xa - ya)\, \{Vi,Xb - yb)\) , 

where 

^ m 

= - sup (K^a + 7/a,nj)| + Kxfo + ?/i„Ui)|)^ 

^ / m m m 

<\\UU*\\s + \\VV*\\s + - sup J^Kxa + ya,^,)! • |(xfe + yfe,i;i)| 



2 (x„,a;6)eT .^-^ 



1/2 / \ 1/2 



< + ^ sup {f2\{Xa + ya,Ui)n ■ { f^Kxb + Vb, V^)\ 

< \\UU*\\s + IIV^I/^IU + 2||i7f/*||y^||l^V^1iy^ 
\\UU*\\l/^ + \\VV*\\l/^Y =: R'\ 
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Defining the norms 

llxllf/ = max |(x, Uj)! and = max |(x, 

i i 

our bound on the metric becomes 

5 ((xa,Xfe), {ya,yb)) < -R'max(||xa - yaWu, \\xb - yb\\v) ■ 

Now let 

T'= [j BE (A.5) 

|r|<s 

and note that T C T' (g> T', and so N{T,5,t) < N{T' (g> T',6,t). If Ci is a t-cover for T' 
under the metric || • \\u and C2 is a t-cover for T' under the metric || • ||y, then Ci Cg) C2 is a 
t cover for T' (8) T' under the metric max(|| • \\u, \\ ■ \\v)- Hence 

N{T,6,t) < N{T',R'\\-\\u,t)-N{T',R'\\-\\v,t) 

and 

/•oo rOQ rOG 

/ \og^/^ N{T,5,t)dt < R! I log^/^ N{T',\\-\\u,t)dt + R' log^/^ iV(r', || • ||y , 
Jo Jo Jo 

(A.6) 

We can now apply estimates for the covering numbers in ()A.6P that were developed in [33], 
where the following is shown. 
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Proposition A. 3 Let xi, . . . ,Xm be vectors in C" with ||a;j||oo < M, and define the norm 
\\x\\x = maxj For T' as in (|A.5p . 



/•oo 

/ log^/2A(r', II • \\x,t)dt < C-M^/s-logslog^/^mlog^ 
Jo 

for some constant C . 

Combining this proposition with (jA.6p yields 

E ||C/Sy* \\s <C ■ E' ■ M ■ ■ log s log^/2 m log^/^ n, 
establishing the lemma. ■ 

A. 2 A concentration inequality 

The following is a specialized version of [23, Th. 6.17], and appears in the following form 
in [371 Prop. 19]. 

Lemma A. 4 Let Vi, . . . , Vm he a sequence of square matrices with \\Vi\\s < M , and let {ei} 
he a Rademacher seqeunce. Set V = eiVi. Then for all u,t >1 

P{\\V\\s > C{uE\\V\\s + tM)} < e-"'+e-*. (A.7) 

B A simple inequality 

Lemma B.l Fix a < 1 and c > 0. // 



/3<a(^c+ V/3 + lj forp>0, (B.l) 

then 



/3 < a (^c + 1/2 + ^c + 5/4) for /3 > 0. 



Proof Let x = (/3 + 1)^/^; note that X is a monotonic function of /3. Then (jB.ip becomes 

x^ — 1 < a(c + x) =^ x'^ — ax — {ac + 1) < 0. 

The polynomial on the left is strictly increasing when x > a/2. Since a < 1 and x > 1 for 
/3 > 0, it is strictly increasing over the entire domain of interest. Thus 



x^ - ax - (ac+l) <0 x< ^ — ^ -. 



Substituting (/3 + 1)^/^ back in for x, this means 



q2 aJa"^ + 4(ac + 1) a2 + 4(ac + l) 
/3 + l<- + ^^ ^ - + 1 
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and so 



C+-+V J i 



<a C+1/2 + Vc + 5/4 



when a < 1. 
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